New York City is full of urban wildlife, and rats are one of the city’s most infamous animals. Rats in NYC are plentiful and in September 2015 a viral video of a brown rat carrying a slice of pizza down the steps of a NYC subway station in Manhattan created the legend of Pizza Rat.
The source for the data of rat sightings in the city is from NYC’s 311
Service Requests from 2010 to Present. This is a dataset that is
pretty much constantly updated and I downloaded this dataset on
11-Oct-2022. For this group project, you will use R,
dplyr, and ggplot2 to explore and explain the
data, and tell an interesting story. The data fields are shown below and
a full
data dictionary can be found here
Rows: 204,232
Columns: 45
$ unique_key <dbl> 40177738, 54095783, 52787670, 52046514,…
$ created_date <dttm> 2018-09-03 12:21:09, 2022-05-05 15:02:…
$ closed_date <chr> "09/14/2018 05:46:05 PM", "05/05/2022 0…
$ agency <chr> "DOHMH", "DOHMH", "DOHMH", "DOHMH", "DO…
$ agency_name <chr> "Department of Health and Mental Hygien…
$ complaint_type <chr> "Rodent", "Rodent", "Rodent", "Rodent",…
$ descriptor <chr> "Rat Sighting", "Rat Sighting", "Rat Si…
$ location_type <chr> "3+ Family Apt. Building", "Other (Expl…
$ incident_zip <dbl> 11229, 11216, 11201, 11222, 10009, 1141…
$ incident_address <chr> "1213 AVENUE U", "1228 DEAN STREET", "2…
$ street_name <chr> "AVENUE U", "DEAN STREET", "WYCKOFF STR…
$ cross_street_1 <chr> "EAST 12 STREET", "WILLIAM BILL HOWARD …
$ cross_street_2 <chr> "HOMECREST AVENUE", "NEW YORK AVENUE", …
$ intersection_street_1 <chr> "EAST 12 STREET", "WILLIAM BILL HOWARD …
$ intersection_street_2 <chr> "HOMECREST AVENUE", "NEW YORK AVENUE", …
$ address_type <chr> "ADDRESS", "ADDRESS", "ADDRESS", "INTER…
$ city <chr> "BROOKLYN", "BROOKLYN", "BROOKLYN", NA,…
$ landmark <chr> NA, "DEAN STREET", "WYCKOFF STREET", NA…
$ facility_type <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ status <chr> "Closed", "Closed", "Closed", "In Progr…
$ due_date <chr> "10/03/2018 12:12:40 PM", NA, NA, NA, "…
$ resolution_action_updated_date <chr> "09/14/2018 05:46:05 PM", "05/05/2022 0…
$ community_board <chr> "15 BROOKLYN", "08 BROOKLYN", "02 BROOK…
$ borough <chr> "Brooklyn", "Brooklyn", "Brooklyn", "Br…
$ x_coordinate_state_plane <dbl> 995446, 998524, 986303, 999634, 990213,…
$ y_coordinate_state_plane <dbl> 157321, 185855, 189477, 203643, 204768,…
$ park_facility_name <chr> "Unspecified", "Unspecified", "Unspecif…
$ park_borough <chr> "BROOKLYN", "BROOKLYN", "BROOKLYN", "BR…
$ vehicle_type <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxi_company_borough <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxi_pick_up_location <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bridge_highway_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bridge_highway_direction <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ road_ramp <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bridge_highway_segment <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latitude <dbl> 40.59848, 40.67679, 40.68675, 40.72562,…
$ longitude <dbl> -73.95968, -73.94854, -73.99260, -73.94…
$ location <chr> "(40.598478991333735, -73.9596835550102…
$ sighting_date <date> 2018-09-03, 2022-05-05, 2021-12-13, 20…
$ sighting_year <dbl> 2018, 2022, 2021, 2021, 2018, 2018, 201…
$ sighting_month <dbl> 9, 5, 12, 10, 7, 7, 7, 7, 12, 12, 6, 4,…
$ sighting_month_name <ord> September, May, December, October, July…
$ sighting_day <int> 3, 5, 13, 1, 12, 12, 15, 24, 26, 28, 6,…
$ sighting_weekday <ord> Monday, Thursday, Monday, Friday, Thurs…
$ sighting_hour <int> 12, 15, 14, 18, 22, 9, 15, 13, 18, 12, …
I have provided some starter code below. A couple comments about it:
vroom() treats cells that are empty or “NA”
as missing values. This rat dataset uses “N/A” to mark missing values,
so we need to add that as a possible marker of missingness (hence
na = c("", "NA", "N/A"))janitor::clean_names() to remove spaces, upper case
letters, etc. from variable names.sighting_year, sighting_month,
sighting_day, and sighting_weekday). You don’t
have to use them, but they’re there if you need them. The functions that
create these, like year() and wday() are part
of the lubridate library.09/14/2018 05:46:05 PM, which R is not able to
automatically parse as a date when reading the CSV file. You can use the
mdy_hms() function in the lubridate
library to parse dates that are structured as
“month-day-year-hour-minute”. There are also a bunch of other iterations
of this function, like ymd(), dmy(), etc., for
other date formats.str_to_title() to turn the upper case borough
into a more legible format, from MANHATTAN to
Manhattan and from STATEN ISLAND to
Staten Islandlibrary(tidyverse)
library(lubridate)
# If you get an error "All formats failed to parse. No formats found",
# it's because the mdy_hms function couldn't parse the date. The date
# variable *should* be in this format: "09/14/2018 05:46:05 PM", but in some
# rare instances, it might load without the seconds as "09/14/2018 05:00 PM".
# If there are no seconds, use mdy_hm() instead of mdy_hms().
rats <- vroom::vroom(here::here("data/Rat_Sightings.csv.zip"), na = c("", "NA", "N/A")) %>%
janitor::clean_names() %>%
mutate(created_date = mdy_hms(created_date)) %>%
mutate(sighting_date = as.Date(created_date), # just the date without hour info
sighting_year = year(created_date),
sighting_month = month(created_date),
sighting_month_name = month(created_date, label = TRUE, abbr = FALSE),
sighting_day = day(created_date),
sighting_weekday = wday(created_date, label = TRUE, abbr = FALSE)
) %>%
filter(borough != "Unspecified") %>%
mutate(borough = str_to_title(borough))You’ll summarise the data with functions from dplyr,
including stuff like count(), arrange(),
filter(), group_by(),
summarise(), and mutate(). Here are some
examples of ways to summarise the data:
# See the count of rat sightings by weekday
rats %>%
count(sighting_weekday)
# Assign a summarixed data frame to an object to use it in a plot
rats_by_weekday <- rats %>%
count(sighting_weekday, sighting_year)
ggplot(rats_by_weekday, aes(x = fct_rev(sighting_weekday), y = n)) +
geom_col() +
coord_flip() +
facet_wrap(~ sighting_year)
# See the count of rat sightings by weekday and borough
rats %>%
count(sighting_weekday, borough, sighting_year)
# An alternative to count() is to specify the groups with group_by() and then
# be explicit about how you're summarising the groups, such as calculating the
# mean, standard deviation, or number of observations (we do that here with
# `n()`).
rats %>%
group_by(sighting_weekday, borough) %>%
summarise(n = n())In the R4DS Exploratory Data Analysis chapter, the authors state:
“Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation…EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.”
Conduct a thorough EDA. Recall that an EDA involves three things:
dplyr::glimpse()skimr::skim()corrr::correlate()mosaic::favstats()ggplot2::ggplot()
geom_histogram() or geom_density() for
numeric continuous variablesgeom_bar() or geom_col() for categorical
variablesGGally::ggpairs() for scaterrlot/correlation matrix
aes call, for example:
aes(colour = borough, alpha = 0.4)You may wish to have a level 1 header (#) for your EDA,
then use level 2 sub-headers (##) to make sure you cover
all three EDA bases. At a minimum you should answer
these questions:
At this stage, you may also find you want to use filter,
mutate, arrange, select, or
count. Let your questions lead you!
In all cases, please think about the message your plot is conveying. Don’t just say “This is my X-axis, this is my Y-axis”, but rather what’s the so what of the plot. Tell some sort of story and speculate about the differences in the patterns in no more than a paragraph.
For each table, make sure to include a relevant figure. One tip for
starting is to draw out on paper what you want your x- and y-axis to be
first and what your geom is; that is, start by drawing the
plot you want ggplot to give you.
Your figure does not have to depict every last number from the data aggregation result. Use your judgement. It just needs to complement the table, add context, and allow for some sanity checking both ways.
Notice which figures are easy/hard to make, which data formats make better inputs for plotting functions vs. for human-friendly tables.
Visualisations of feature distributions and their relations are key
to understanding a data set, and they can open up new lines of
exploration. While we do not have time to go into all the wonderful
geospatial visualisations one can do with R, you can use the following
code to start with a map of your city, and overlay all rat sighting
coordinates to get an overview of the spatial distribution of reported
rat activity. For this visualisation we use the leaflet
package, which includes a variety of tools for interactive maps, so you
can easily zoom in-out, click on a point to get the actual location type
and date/time of sighting. If you wanted to, you can learn more about
leafelt the package that draws the interactive map, by
following the
relevant Datacamp course on mapping with leaflet
# let's get the top 7 location types, which account for > 90% of all cases
# this code generates a vector with the top 7 location types
top_location_types <- rats %>%
count(location_type, sort=TRUE) %>%
mutate(perc = 100*n/sum(n)) %>%
slice(1:7) %>%
select(location_type) %>%
pull()
# lets us choose how to colour each point. What palette and colours to use?
# A great site to get the relevant color hex codes for maps is
# https://colorbrewer2.org/#type=qualitative&scheme=Set1&n=7
my_colours <- c('#e41a1c','#377eb8','#4daf4a','#984ea3','#ff7f00','#ffff33','#a65628')
# create a function point_fill, that assigns `my_colours` to different location types
# you can read more here https://rstudio.github.io/leaflet/colors.html
point_fill <- colorFactor(palette = my_colours,
rats$location_type)
rats %>%
filter(sighting_year == 2021) %>% # just show 2021 data
filter(location_type %in% top_location_types) %>%
leaflet() %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~point_fill(location_type),
fillOpacity = 0.6,
popup = ~created_date,
label = ~location_type) %>%
addLegend("bottomright", pal = point_fill,
values = ~location_type,
title = "2021 Location Type",
opacity = 0.5)Summarise the data somehow. The raw data has more than 200,000 rows,
which means you’ll need to aggregate the data (filter(),
group_by(), and summarise() are your friends).
Consider looking at the number of sightings per borough, per year, per
dwelling type, etc., or a combination of these, like the change in the
number sightings across the 5 boroughs between, say, 2014 and 2021.
You can recreate one of these ugly, less-than-helpful graphs, or create a new story by looking at other variables in the data.
Recall that the whole point of inferential statistics is we want to make a stateent/inferene about the population, given the sample statistics we have observed. In this case, our sample is the number of rat sightings by borough and year– surely there’s more rats, we just dont seem them all! Some people argue that rat sightings are related to human activity– where there’s food and junk that humans create, rats will surely appear.
How closely do rat sightings track the human population in a borough?
Are the % of rat sightings in each borough related to that borough’s
population? We got data from the 2020 census and can create a small
tibble nyc_population with the relevant data.
# https://en.wikipedia.org/wiki/Boroughs_of_New_York_City
# NYC Boroughs, 2020 census data
# Bronx 1,472,654
# Brooklyn 2,736,074
# Manhattan 1,694,263
# Queens 2,405,464
# Staten Island 495,747
nyc_population <- tibble(
borough = c("Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island"),
population = c(1472654,2736074,1694263,2405464,495747)
) Your task is to calculate Cofidence Intervals (CIs) for the % of rat sightings in each borough and in each year between 2014 and 2022. This will give you CIs for the percentage of NYC rats that exist in that borugh/year You then need to compare those CIs with the human population % as shown in the following graph.
To change the colour of the text in the title so it matches the
orange dot corresponding to the population % , you need to use the
ggtext package and the following code
library(ggtext)
# your ggplot code +
labs(
title = "Mouse sightings don't always track a <span style='color:#FFA500'><b>borough's population</span></b>"
) +
theme(
plot.title.position = "plot",
plot.title = element_textbox_simple(size=16)) +
NULLBesides the variables included in the rats dataframe, we
have also downloaded weather data for NYC which can be found at the
nyc_weather dataframe. It may be the case that rats are
more active when it’s warmer? or when it rains?
Build a regression model that helps you explain the number of
sightings per day. You can use both the original data set, as well as
the nyc_weather, the variables of which are shown
below.
Rows: 3,734
Columns: 32
$ name <chr> "New York", "New York", "New York", "New York", "New …
$ datetime <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-01-04, 2010…
$ tempmax <dbl> 4.3, 1.1, -5.5, -1.0, -0.8, 1.3, 3.2, 0.9, -0.8, -1.8…
$ tempmin <dbl> 0.9, -8.0, -8.0, -6.8, -6.1, -3.0, -1.1, -4.8, -6.8, …
$ temp <dbl> 2.4, -3.5, -6.9, -4.7, -3.7, -1.6, 0.7, -1.3, -4.6, -…
$ feelslikemax <dbl> 4.1, -3.8, -13.2, -6.6, -6.1, -3.3, -0.7, 0.7, -5.1, …
$ feelslikemin <dbl> -1.8, -17.3, -17.3, -14.8, -12.8, -9.0, -6.6, -11.4, …
$ feelslike <dbl> 1.0, -10.5, -15.2, -11.3, -9.7, -7.1, -3.8, -5.5, -10…
$ dew <dbl> -0.4, -10.2, -14.4, -12.4, -11.3, -9.5, -6.2, -6.5, -…
$ humidity <dbl> 82.5, 60.2, 55.1, 55.3, 55.6, 55.0, 60.4, 68.5, 47.0,…
$ precip <dbl> 2.40, 0.18, 0.00, 0.00, 0.00, 0.00, 0.00, 0.68, 0.00,…
$ precipprob <dbl> 100, 100, 0, 0, 0, 0, 0, 100, 0, 0, 0, 0, 0, 0, 0, 0,…
$ precipcover <dbl> 16.67, 4.17, 0.00, 0.00, 0.00, 0.00, 0.00, 25.00, 0.0…
$ preciptype <chr> "rain,snow", "rain,snow", NA, NA, NA, NA, NA, "rain,s…
$ snow <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ snowdepth <dbl> 3, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ windgust <dbl> 37.1, 58.0, 60.6, 42.0, 40.7, 44.7, 48.2, 49.0, 40.7,…
$ windspeed <dbl> 15.5, 35.1, 36.2, 27.1, 26.5, 27.1, 26.4, 32.2, 23.6,…
$ winddir <dbl> 309.2, 295.4, 290.4, 294.4, 295.1, 299.2, 292.5, 281.…
$ sealevelpressure <dbl> 1011.7, 1004.9, 1000.6, 1005.7, 1006.4, 1006.3, 1011.…
$ cloudcover <dbl> 99.6, 79.3, 73.3, 28.1, 50.1, 55.5, 50.2, 78.4, 1.5, …
$ visibility <dbl> 10.6, 14.3, 15.3, 16.0, 16.0, 16.0, 16.0, 12.7, 16.0,…
$ solarradiation <dbl> 80.1, 49.9, 36.3, 69.2, 112.9, 100.1, 114.5, 68.0, 12…
$ solarenergy <dbl> 6.8, 4.1, 3.1, 5.9, 9.7, 8.7, 9.8, 5.9, 10.4, 10.6, 9…
$ uvindex <dbl> 4, 2, 1, 3, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 5, 1, 5,…
$ severerisk <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ sunrise <dttm> 2010-01-01 07:20:12, 2010-01-02 07:20:17, 2010-01-03…
$ sunset <dttm> 2010-01-01 16:39:17, 2010-01-02 16:40:08, 2010-01-03…
$ moonphase <dbl> 0.51, 0.53, 0.55, 0.60, 0.65, 0.70, 0.76, 0.81, 0.86,…
$ conditions <chr> "Snow, Rain, Overcast", "Snow, Rain, Partially cloudy…
$ description <chr> "Cloudy skies throughout the day with rain or snow cl…
$ icon <chr> "rain", "rain", "partly-cloudy-day", "partly-cloudy-d…
In essence, we want to see how many rat sightings we have per day in each borough. Please use the following code to join the two dataframes before you run your regression model.
rats_weather <- rats %>%
mutate(date = as.Date(created_date)) %>%
count(borough,date) %>%
left_join(nyc_weather, by = c("date" = "datetime")) %>%
mutate(month = month(date),
month_name = month(date, label=TRUE),
day = wday(date),
day_of_week = wday(date, label=TRUE))Use histograms or density plots to examine the distributions of
n, the number of rat sightings, and log(n).
Which variable should you use for the regression model? Why?
Fit a regression model called model1 with the
following explanatory variables: temp.
temp significant? Why?temp explain?Fit a regression model called model2 with the
following explanatory variables: temp and
borough
temp significant? Why?Bronx missing from the boroughs in our
model?n
versus log(n)log transformation, the intrepretation of
the betas/coefficients is slightly different. You can read more about
that here FAQ
HOW DO I INTERPRET A REGRESSION MODEL WHEN SOME VARIABLES ARE LOG
TRANSFORMED?Further variables/questions to explore on your own
Our dataset has many more variables, so here are some ideas on how you can extend your analysis
n?As you keep building your models, it makes sense to:
performance::check_model(model_x). You will always have
some deviation from normality, especially for very high values of
ncar::vif(model_x) to calculate the
Variance Inflation Factor (VIF) for your predictors and
determine whether you have colinear variables. A general guideline is
that a VIF larger than 10 is large, and your model may suffer from
collinearity. Remove the variable in question and run your model again
without it.huxtable (https://mfa2023.netlify.app/example/modelling_side_by_side_tables/)
that shows which models you worked on, which predictors are significant,
the adjusted \(R^2\), and the Residual
Standard Error.There is a detailed assessment rubric below.
It’s not enough for your code to run– it must be well documented and commented, so another person can read your work, reproduce your code, and easily understand what you are trying to achieve.